FINAL PROJECT-MACHINE LEARNING

Libraries And Importing Data Set

In [1]:
!pip3 install plotly
!pip install jupyter_contrib_nbextensions
Requirement already satisfied: plotly in c:\users\user\anaconda3\anaconda3\lib\site-packages (4.14.3)
Requirement already satisfied: retrying>=1.3.3 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in c:\users\user\anaconda3\anaconda3\lib\site-packages (from plotly) (1.15.0)
Requirement already satisfied: jupyter_contrib_nbextensions in c:\users\user\anaconda3\anaconda3\lib\site-packages (0.5.1)
Requirement already satisfied: jupyter-highlight-selected-word>=0.1.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.2.0)
Requirement already satisfied: jupyter-latex-envs>=1.3.8 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (1.4.6)
Requirement already satisfied: jupyter-nbextensions-configurator>=0.4.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.4.1)
Requirement already satisfied: pyyaml in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (5.3.1)
Requirement already satisfied: tornado in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (6.0.4)
Requirement already satisfied: jupyter-contrib-core>=0.3.3 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.3.3)
Requirement already satisfied: nbconvert>=4.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (6.0.7)
Requirement already satisfied: jupyter-core in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (4.6.3)
Requirement already satisfied: notebook>=4.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (6.1.4)
Requirement already satisfied: lxml in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (4.6.1)
Requirement already satisfied: traitlets>=4.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (5.0.5)
Requirement already satisfied: ipython-genutils in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.2.0)
Requirement already satisfied: ipython in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (7.19.0)
Requirement already satisfied: setuptools in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-contrib-core>=0.3.3->jupyter_contrib_nbextensions) (50.3.1.post20201107)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.8.4)
Requirement already satisfied: jupyterlab-pygments in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.1.2)
Requirement already satisfied: defusedxml in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.6.0)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.3)
Requirement already satisfied: pygments>=2.4.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (2.7.2)
Requirement already satisfied: jinja2>=2.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (2.11.2)
Requirement already satisfied: nbformat>=4.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (5.0.8)
Requirement already satisfied: bleach in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (3.2.1)
Requirement already satisfied: testpath in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.4.4)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.5.1)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (1.4.3)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-core->jupyter_contrib_nbextensions) (227)
Requirement already satisfied: pyzmq>=17 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (19.0.2)
Requirement already satisfied: argon2-cffi in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (20.1.0)
Requirement already satisfied: terminado>=0.8.3 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (0.9.1)
Requirement already satisfied: jupyter-client>=5.3.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (6.1.7)
Requirement already satisfied: ipykernel in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (5.3.4)
Requirement already satisfied: Send2Trash in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (1.5.0)
Requirement already satisfied: prometheus-client in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (0.8.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (3.0.8)
Requirement already satisfied: backcall in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.2.0)
Requirement already satisfied: decorator in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (4.4.2)
Requirement already satisfied: jedi>=0.10 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.17.1)
Requirement already satisfied: pickleshare in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.7.5)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.4.4)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jinja2>=2.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.1.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (3.2.0)
Requirement already satisfied: webencodings in c:\users\user\anaconda3\anaconda3\lib\site-packages (from bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (0.5.1)
Requirement already satisfied: six>=1.9.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.15.0)
Requirement already satisfied: packaging in c:\users\user\anaconda3\anaconda3\lib\site-packages (from bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (20.4)
Requirement already satisfied: async-generator in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.10)
Requirement already satisfied: nest-asyncio in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.4.2)
Requirement already satisfied: cffi>=1.0.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from argon2-cffi->notebook>=4.0->jupyter_contrib_nbextensions) (1.14.3)
Requirement already satisfied: pywinpty>=0.5 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from terminado>=0.8.3->notebook>=4.0->jupyter_contrib_nbextensions) (0.5.7)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-client>=5.3.4->notebook>=4.0->jupyter_contrib_nbextensions) (2.8.1)
Requirement already satisfied: wcwidth in c:\users\user\anaconda3\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.2.5)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jedi>=0.10->ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.7.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (0.17.3)
Requirement already satisfied: attrs>=17.4.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (20.3.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from packaging->bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (2.4.7)
Requirement already satisfied: pycparser in c:\users\user\anaconda3\anaconda3\lib\site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.0->jupyter_contrib_nbextensions) (2.20)
In [2]:
!pip3 install folium
!jupyter nbextension install <url>/toc2.zip --user
!jupyter nbextension enable toc2/main
Requirement already satisfied: folium in c:\users\user\anaconda3\anaconda3\lib\site-packages (0.12.1)
Requirement already satisfied: numpy in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (1.19.2)
Requirement already satisfied: requests in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (2.24.0)
Requirement already satisfied: branca>=0.3.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (0.4.2)
Requirement already satisfied: jinja2>=2.9 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (2.11.2)
Requirement already satisfied: idna<3,>=2.5 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (1.25.11)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (1.1.1)
The system cannot find the file specified.
Enabling notebook extension toc2/main...
      - Validating: ok
In [3]:
## Plotting Libaray
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium.plugins import HeatMap
## Pandas Dataframe Library
import pandas as pd
## Numpy Library
import numpy as np
## Train and Test Split
from sklearn.model_selection import train_test_split
## Evaluation Matrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
## Normalize
from sklearn.preprocessing import MinMaxScaler
## Models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
## Kfold and ROC
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV,StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
In [4]:
df=pd.read_csv('feature_data.csv') #reading the file of data
label_df=pd.read_csv('label_data.csv') #reading the file of labels

Exploratory Analysis

Exploratory Analysis is term used in data analysis, which is used to explore the data and extract useful information from it.
Exploratory Analysis is carried out once you have all the data collected, cleaned and processed. You find the exact information you need to carry out the analysis or if you need more through manipulating the data.
During this step, one can use various techniques in python (such as functions and plots) to carry out analysis, which leads to the understanding of the data in order to become able to interpret it more effectively and derive better conclusions according to the requirements.

Merging both dataframes

In [5]:
df['cancelation'] = label_df['cancelation'] #the column "cancelation" of labels =  the column "cancelation" in the data file

Checking the new column:

In [6]:
df.head() 
Out[6]:
Unnamed: 0 time_until_order order_year order_month order_week order_day_of_month adults children babies country ... anon_feat_5 anon_feat_6 anon_feat_7 anon_feat_8 anon_feat_9 anon_feat_10 anon_feat_11 anon_feat_12 anon_feat_13 cancelation
0 51014 309.0 2016 May week_20 13 2 0.0 0 PRT ... 0.0 215.0 0.0 0 0.0 0.250606 17.588299 True 1.0 True
1 28536 3.0 2016 October week_41 2 2 0.0 0 ESP ... 3.0 0.0 1.0 1 1.0 0.444719 2.343371 True NaN False
2 21745 NaN 2017 March week_12 19 1 0.0 0 DEU ... 4.0 0.0 0.0 0 1.0 0.598733 2.498820 True NaN False
3 17502 153.0 2015 September week_40 29 2 0.0 0 GBR ... 3.0 0.0 0.0 0 1.0 0.335675 12.411559 True NaN False
4 83295 33.0 2016 January week_5 25 2 0.0 0 BRA ... 0.0 15.0 0.0 0 0.0 0.492874 5.743378 True NaN False

5 rows × 35 columns

Dimensionlaity of the Model

In [7]:
print("Row's number:", df.shape[0]) #the 0 axis is the rows
print("Columns's number:", df.shape[1]) #the 1 axis is the columns
Row's number: 89542
Columns's number: 35

Another way for checking the dimensionality of the model:

In [8]:
df.shape
Out[8]:
(89542, 35)

Exploring columns

Checking columns names in the data file:

In [9]:
df.columns
Out[9]:
Index(['Unnamed: 0', 'time_until_order', 'order_year', 'order_month',
       'order_week', 'order_day_of_month', 'adults', 'children', 'babies',
       'country', 'order_type', 'acquisition_channel', 'prev_canceled',
       'prev_not_canceled', 'changes', 'deposit_type', 'agent', 'company',
       'customer_type', 'adr', 'anon_feat_0', 'anon_feat_1', 'anon_feat_2',
       'anon_feat_3', 'anon_feat_4', 'anon_feat_5', 'anon_feat_6',
       'anon_feat_7', 'anon_feat_8', 'anon_feat_9', 'anon_feat_10',
       'anon_feat_11', 'anon_feat_12', 'anon_feat_13', 'cancelation'],
      dtype='object')

Exploring the type of columns:

In [10]:
df.dtypes
Out[10]:
Unnamed: 0               int64
time_until_order       float64
order_year               int64
order_month             object
order_week              object
order_day_of_month       int64
adults                   int64
children               float64
babies                   int64
country                 object
order_type              object
acquisition_channel     object
prev_canceled            int64
prev_not_canceled        int64
changes                float64
deposit_type            object
agent                  float64
company                float64
customer_type           object
adr                    float64
anon_feat_0            float64
anon_feat_1              int64
anon_feat_2              int64
anon_feat_3              int64
anon_feat_4              int64
anon_feat_5            float64
anon_feat_6            float64
anon_feat_7            float64
anon_feat_8              int64
anon_feat_9            float64
anon_feat_10           float64
anon_feat_11           float64
anon_feat_12              bool
anon_feat_13           float64
cancelation               bool
dtype: object

Informations about the data:

In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89542 entries, 0 to 89541
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           89542 non-null  int64  
 1   time_until_order     76861 non-null  float64
 2   order_year           89542 non-null  int64  
 3   order_month          86108 non-null  object 
 4   order_week           89542 non-null  object 
 5   order_day_of_month   89542 non-null  int64  
 6   adults               89542 non-null  int64  
 7   children             89538 non-null  float64
 8   babies               89542 non-null  int64  
 9   country              85201 non-null  object 
 10  order_type           89542 non-null  object 
 11  acquisition_channel  89542 non-null  object 
 12  prev_canceled        89542 non-null  int64  
 13  prev_not_canceled    89542 non-null  int64  
 14  changes              86065 non-null  float64
 15  deposit_type         80536 non-null  object 
 16  agent                77346 non-null  float64
 17  company              5062 non-null   float64
 18  customer_type        79647 non-null  object 
 19  adr                  86559 non-null  float64
 20  anon_feat_0          86161 non-null  float64
 21  anon_feat_1          89542 non-null  int64  
 22  anon_feat_2          89542 non-null  int64  
 23  anon_feat_3          89542 non-null  int64  
 24  anon_feat_4          89542 non-null  int64  
 25  anon_feat_5          85510 non-null  float64
 26  anon_feat_6          85309 non-null  float64
 27  anon_feat_7          85294 non-null  float64
 28  anon_feat_8          89542 non-null  int64  
 29  anon_feat_9          85811 non-null  float64
 30  anon_feat_10         86810 non-null  float64
 31  anon_feat_11         84585 non-null  float64
 32  anon_feat_12         89542 non-null  bool   
 33  anon_feat_13         5776 non-null   float64
 34  cancelation          89542 non-null  bool   
dtypes: bool(2), float64(14), int64(12), object(7)
memory usage: 22.7+ MB

Summary of data

In [12]:
df.describe() #describing the data
Out[12]:
Unnamed: 0 time_until_order order_year order_day_of_month adults children babies prev_canceled prev_not_canceled changes ... anon_feat_3 anon_feat_4 anon_feat_5 anon_feat_6 anon_feat_7 anon_feat_8 anon_feat_9 anon_feat_10 anon_feat_11 anon_feat_13
count 89542.000000 76861.000000 89542.000000 89542.000000 89542.000000 89538.000000 89542.000000 89542.000000 89542.000000 86065.000000 ... 89542.000000 89542.000000 85510.000000 85309.000000 85294.000000 89542.000000 85811.000000 86810.000000 84585.000000 5776.000000
mean 59716.762871 103.673879 2016.157658 15.828807 1.857497 0.103732 0.007896 0.087411 0.137701 0.223877 ... 0.032231 0.989971 1.330944 2.339401 0.062607 0.571922 0.335691 0.427146 8.845679 0.365132
std 34495.242240 106.940156 0.707461 8.779753 0.565296 0.397797 0.095194 0.849799 1.496269 0.663361 ... 0.176613 1.698086 1.879927 17.516854 0.243415 0.793567 0.472234 0.128140 5.236673 0.481509
min 0.000000 0.000000 2015.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.161008 0.038632 0.000000
25% 29838.250000 18.000000 2016.000000 8.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.328012 4.452191 0.000000
50% 59743.500000 69.000000 2016.000000 16.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.425622 8.422255 0.000000
75% 89610.500000 159.000000 2017.000000 23.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 3.000000 3.000000 0.000000 0.000000 1.000000 1.000000 0.511077 12.712815 1.000000
max 119388.000000 737.000000 2017.000000 31.000000 55.000000 10.000000 10.000000 26.000000 72.000000 21.000000 ... 1.000000 9.000000 11.000000 391.000000 3.000000 5.000000 1.000000 0.907525 27.172399 1.000000

8 rows × 26 columns

  • the mean of Time until order is 103 days and the median is 69.
  • Each booking has on an average 1.857 adults and 0.103 children and 0.007896 babies.
  • The percentage of the people have cancelled their booking is 37.07%. -The mean of order day in month is 15.82.

Booking cancellation

Checking how many booking cancellation were done:

In [13]:
df['cancelation'].value_counts() 
Out[13]:
False    56346
True     33196
Name: cancelation, dtype: int64

In this pie graph we can see the percentage of bookings that were cancelled and bookings that were not:
1 indicates a cancelled booking
0 indicates a not cancelled booking

In [14]:
cancellation_graph = px.pie(df, values=df['cancelation'].value_counts().values, names=df['cancelation'].value_counts().index,
             title='Cancelation' , color_discrete_sequence=px.colors.sequential.Peach)
cancellation_graph.update_traces(textposition='inside', textinfo='percent+label')
cancellation_graph.show()

Order's types

Checking the amount of orders for every type:

In [15]:
df['order_type'].value_counts()
Out[15]:
Online TA        42450
Offline TA/TO    18154
Groups           14762
Direct            9487
Corporate         3957
Complementary      560
Aviation           170
Undefined            2
Name: order_type, dtype: int64

In this pie graph we see the percentages for each type of orders:
We can notice that most people order "online TA"

In [16]:
order_types = px.pie(df, values=df['order_type'].value_counts().values, names=df['order_type'].value_counts().index,
             title='The type of Orders' ,color_discrete_sequence=px.colors.sequential.Mint
            )
order_types.update_traces(textposition='inside', textinfo='percent+label')
order_types.show()

Countries with the highest bookings

Checking which country citizens book the most:

In [17]:
df['country'].value_counts().head(1)
Out[17]:
PRT    34804
Name: country, dtype: int64

Top 10 countries:

In [18]:
df['country'].value_counts().head(10)
Out[18]:
PRT    34804
GBR     8676
FRA     7448
ESP     6170
DEU     5280
ITA     2654
IRL     2412
BEL     1713
BRA     1605
USA     1523
Name: country, dtype: int64

Bar plot describing the amount of citizens from countries:

In [19]:
fig = go.Figure(data=[go.Bar(
            x=df['country'].value_counts().index[0:10], y=df['country'].value_counts().values[0:10],
            text=df['country'].value_counts().values[0:10],
            textposition='outside',marker_color='lightseagreen'
        )])

fig.show()

Pie plot describing the amount of citizens from countries:

In [20]:
countries = px.pie(df, values=df['country'].value_counts().values, names=df['country'].value_counts().index,
             title='Countries' ,color_discrete_sequence=px.colors.sequential.RdPu)
countries.update_traces(textposition='inside', textinfo='percent+label')
countries.show()

Countries with the most guests

Creating a new feature thats give us the total number of guests: (includes adults/children/babies)

In [21]:
df['guest'] = df['adults'] + df['children'] + df['babies']

Showing the top 5 countries with guests that didnt cancell their booking:

In [22]:
#cancelation==Flase in order to fetch only the orders that were not cancelled
guest_per_country = df[df['cancelation'] == False]['country'].value_counts().reset_index()
guest_per_country.columns = ['country', 'guest'] #selecting this two columns
guest_per_country.head()
Out[22]:
country guest
0 PRT 15060
1 GBR 6903
2 FRA 6084
3 ESP 4630
4 DEU 4369

A world map showing the countries with the number of guests:
Yellow indicates the largest number of guests

In [23]:
map_of_countries = folium.Map()
guest_map = px.choropleth(guest_per_country, locations = guest_per_country['country'],color = guest_per_country['guest'],
                          hover_name = guest_per_country['country'])
guest_map.show()

Month with the most orders

Bar plot that indicates the number of orders that didnt be cancelled by each month .
We can see that August and July has most order, we notice that its summer vacation so most of the families prefer to go for a vacation

In [24]:
month_order = df[df['cancelation'] == False]['order_month'].value_counts().reset_index()
# Fetching only the orders that were not cancelled
graph_order_month = px.scatter(x=month_order['index'], y=month_order['order_month']) 
# axis X marks the month
# And axis Y  -the number of orders
graph_order_month = go.Figure(data=[go.Bar(x=month_order['index'], y=month_order['order_month'],
            text=month_order['order_month'],textposition='outside',marker_color='green')])
graph_order_month.show()
In [25]:
month_order = df[df['cancelation'] == False]['order_month'].value_counts()
order_and_month = px.pie(df, values=month_order, names=df['order_month'].value_counts().index,
             title='Orders and months' ,color_discrete_sequence=px.colors.sequential.BuGn)
order_and_month.update_traces(textposition='inside', textinfo='percent+label')
order_and_month.show()

We can see in this bar plot that the highest ADR is in August and we also can notice that most cancelations happens in the same month,we can conclude that adr is the reason of that.

In [26]:
plt.figure(figsize=(15,10)) #size of graph
sns.barplot(x='order_month', y='adr', hue='cancelation', palette= 'summer', data=df)
plt.title('Order Month vs ADR vs Booking Cancellation Status')
Out[26]:
Text(0.5, 1.0, 'Order Month vs ADR vs Booking Cancellation Status')

Type of customers that cancel their booking

Crosstab showing the number of customer types that cancelled their booking:
We can see that people with transient cancel the most

In [27]:
pd.crosstab([df["cancelation"]], df["customer_type"],margins = True).style.background_gradient(cmap = "gist_gray")
Out[27]:
customer_type Contract Group Transient Transient-Party All
cancelation
False 1834 355 35516 12386 50091
True 837 37 24463 4219 29556
All 2671 392 59979 16605 79647

Catplot that shows the cancelations/not according to the customer type:

In [28]:
sns.catplot(x='customer_type', col = 'cancelation', data=df, kind = 'count', palette='Set2') #countplot
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x16fc49a2820>

Deposite type effect on cancellation

Crosstab showing the number of deposit types that cancelled their booking:
We can see that most of the cancellations are of those booking where no deposit

In [29]:
pd.crosstab([df["cancelation"]], df["deposit_type"],margins = True).style.background_gradient(cmap = "Oranges")
Out[29]:
deposit_type No Deposit Non Refund Refundable All
cancelation
False 50421 59 94 50574
True 20126 9811 25 29962
All 70547 9870 119 80536

Catplot that shows the cancelations/not according to the deposite type:

In [30]:
sns.catplot(x="deposit_type", col = 'cancelation', data=df, kind = 'count', palette='rainbow')
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x16fc4cb5fa0>

Percentage of cancellation each year

In [31]:
adverage_of_cancellation = df.groupby(['order_year'])['cancelation'].mean()
print("Cancellation Percantage per year",adverage_of_cancellation*100)
Cancellation Percantage per year order_year
2015    37.249423
2016    35.860333
2017    38.663789
Name: cancelation, dtype: float64

Barplot showing the adverage of cancellation by year:

In [32]:
cancell_year =go.Figure(data=[go.Bar(x=adverage_of_cancellation.index, y=adverage_of_cancellation*100,text=adverage_of_cancellation*100,
            textposition='outside',marker_color='grey')])
cancell_year.show()

Density Curve of Time untill order by cancelation

In [33]:
(sns.FacetGrid(df, hue = 'cancelation',height = 6,xlim = (0,500)).map(sns.kdeplot, 'time_until_order', shade = True)
    .add_legend());
#we can notice that the peak of cancellations is close to 50 days for order, also we can notice that people who order in short time tend not to cancell.

Feature Histogram

In [34]:
df_num = df.select_dtypes(include=np.number)
df_num.hist(figsize=(15,15))
plt.show()
#we can notice that anon_feat_10 may be normalized but the rest is not

Preprocessing

A preliminary processing of data in order to prepare it for the primary processing or for further analysis.

Correlation between features before handling the data

In [35]:
#The explanations of the correlation is done with the other corellation section
## correlaion before null values'S handiling
features = df.columns ## Fetching all Features Column names
## Applying Pearson Correaltion 
mask = np.zeros_like(df[features].corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 
## Creating a Plot Diagram
f, ax = plt.subplots(figsize=(16, 12))
## Title of Plot
plt.title('Pearson Correlation Matrix before null handiling',fontsize=27)
sns.heatmap(df[features].corr(),linewidths=0.25,vmax=0.7,square=True,cmap="RdGy", 
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});

Another Way for Checking Correlation

In [36]:
df.corr().style.background_gradient(cmap='coolwarm')
#The cell with darker culur indicates a stronger corellation
Out[36]:
Unnamed: 0 time_until_order order_year order_day_of_month adults children babies prev_canceled prev_not_canceled changes agent company adr anon_feat_0 anon_feat_1 anon_feat_2 anon_feat_3 anon_feat_4 anon_feat_5 anon_feat_6 anon_feat_7 anon_feat_8 anon_feat_9 anon_feat_10 anon_feat_11 anon_feat_12 anon_feat_13 cancelation guest
Unnamed: 0 1.000000 0.008234 0.299310 0.010057 -0.005796 -0.023063 -0.028345 -0.018364 -0.000844 -0.004441 -0.615504 -0.208787 0.020074 -0.149813 -0.205640 0.041075 -0.019378 -0.157052 -0.195246 -0.013731 -0.133586 0.106112 -0.818237 -0.011636 0.004953 -0.001602 -0.255275 -0.242871 -0.021291
time_until_order 0.008234 1.000000 0.039087 0.003031 0.123699 -0.039179 -0.021315 0.084876 -0.074639 0.000295 -0.071875 0.156485 0.012048 0.087673 0.165793 -0.002684 -0.125870 -0.104356 -0.170992 0.172199 -0.115189 -0.094990 -0.078964 -0.487098 0.959980 -0.005159 0.291412 0.294502 0.072419
order_year 0.299310 0.039087 1.000000 0.001116 0.034312 0.051760 -0.012514 -0.119836 0.029331 0.030669 0.062462 0.255305 0.024363 0.019992 0.029429 0.065655 0.009743 0.092226 0.035632 -0.056072 -0.014540 0.108936 -0.034409 -0.039040 0.050678 -0.002689 0.014805 0.014949 0.054598
order_day_of_month 0.010057 0.003031 0.001116 1.000000 -0.001731 0.015827 0.001337 -0.024917 0.001862 0.010334 -0.000040 0.050604 0.043146 -0.015037 -0.028578 -0.005779 -0.006582 0.015121 0.009192 0.021225 0.006901 0.003584 0.000472 -0.010269 0.007953 -0.001133 0.008743 -0.006737 0.007684
adults -0.005796 0.123699 0.034312 -0.001731 1.000000 0.032568 0.020909 -0.007315 -0.111358 -0.049944 -0.035456 0.199110 0.125008 0.095898 0.091632 0.024537 -0.151878 0.216493 0.147987 -0.009634 0.019174 0.130470 0.010017 -0.124635 0.151967 0.003792 0.053173 0.058103 0.815968
children -0.023063 -0.039179 0.051760 0.015827 0.032568 1.000000 0.025264 -0.024775 -0.021466 0.050719 0.042756 0.030002 0.154710 0.047129 0.044691 -0.050536 -0.033760 0.377861 0.329290 -0.033714 0.056907 0.080463 0.046501 -0.002811 -0.028192 -0.001040 0.030702 0.005691 0.588676
babies -0.028345 -0.021315 -0.012514 0.001337 0.020909 0.025264 1.000000 -0.007703 -0.007084 0.081313 0.037577 0.009572 0.016577 0.017176 0.019489 0.003212 -0.009158 0.041805 0.043314 -0.010851 0.034375 0.099443 0.043540 0.009081 -0.020595 0.007887 -0.030580 -0.032331 0.164627
prev_canceled -0.018364 0.084876 -0.119836 -0.024917 -0.007315 -0.024775 -0.007703 1.000000 0.147278 -0.027251 -0.012652 -0.187248 -0.031825 -0.015218 -0.015059 -0.003083 0.081758 -0.049984 -0.059345 0.006438 -0.018536 -0.048199 0.013700 -0.033175 0.076986 0.007404 0.108098 0.109633 -0.020703
prev_not_canceled -0.000844 -0.074639 0.029331 0.001862 -0.111358 -0.021466 -0.007084 0.147278 1.000000 0.009769 0.020546 -0.212025 -0.040119 -0.043565 -0.049105 -0.040490 0.419006 -0.021786 0.003749 -0.009848 0.042869 0.038668 0.005575 0.078774 -0.096237 0.005830 -0.067665 -0.060068 -0.101481
changes -0.004441 0.000295 0.030669 0.010334 -0.049944 0.050719 0.081313 -0.027251 0.009769 1.000000 0.066958 0.129196 0.042877 0.055910 0.098425 0.025598 0.009176 0.045317 0.096657 -0.012623 0.063982 0.054734 0.070476 -0.005321 0.006169 -0.002519 -0.141534 -0.144559 -0.000553
agent -0.615504 -0.071875 0.062462 -0.000040 -0.035456 0.042756 0.037577 -0.012652 0.020546 0.066958 1.000000 0.348193 0.018900 0.144238 0.182138 -0.050271 0.029123 0.211083 0.238588 -0.055327 0.178723 0.031228 0.790402 0.049566 -0.073886 -0.002113 -0.094053 -0.081911 0.006045
company -0.208787 0.156485 0.255305 0.050604 0.199110 0.030002 0.009572 -0.187248 -0.212025 0.129196 0.348193 1.000000 0.072142 0.065422 0.186355 0.119467 -0.241961 0.034927 0.092052 -0.008472 -0.008917 -0.107560 0.355493 -0.149590 0.194627 0.002753 0.023725 -0.012482 0.189886
adr 0.020074 0.012048 0.024363 0.043146 0.125008 0.154710 0.016577 -0.031825 -0.040119 0.042877 0.018900 0.072142 1.000000 0.216139 0.243888 0.031311 -0.069512 0.185827 0.126162 -0.033824 0.018265 0.136532 0.019577 -0.062123 0.042914 -0.005026 -0.085282 -0.070436 0.188005
anon_feat_0 -0.149813 0.087673 0.019992 -0.015037 0.095898 0.047129 0.017176 -0.015218 -0.043565 0.055910 0.144238 0.065422 0.216139 1.000000 0.501159 0.045912 -0.088181 0.142874 0.086890 -0.054308 -0.015567 0.075290 0.188555 -0.131908 0.136377 0.001395 0.014841 -0.000038 0.104353
anon_feat_1 -0.205640 0.165793 0.029429 -0.028578 0.091632 0.044691 0.019489 -0.015059 -0.049105 0.098425 0.182138 0.186355 0.243888 0.501159 1.000000 0.037282 -0.098017 0.168091 0.101210 -0.001089 -0.024434 0.067968 0.233295 -0.182914 0.221843 0.003176 0.037926 0.026997 0.100461
anon_feat_2 0.041075 -0.002684 0.065655 -0.005779 0.024537 -0.050536 0.003212 -0.003083 -0.040490 0.025598 -0.050271 0.119467 0.031311 0.045912 0.037282 1.000000 -0.057257 -0.120749 -0.120068 -0.008728 -0.038884 0.024830 -0.010596 -0.021945 0.012804 -0.001637 -0.021949 -0.016972 -0.008321
anon_feat_3 -0.019378 -0.125870 0.009743 -0.006582 -0.151878 -0.033760 -0.009158 0.081758 0.419006 0.009176 0.029123 -0.241961 -0.069512 -0.088181 -0.098017 -0.057257 1.000000 -0.031990 0.030655 -0.022720 0.069486 0.009119 0.051930 0.162063 -0.171925 0.003590 -0.082977 -0.085612 -0.140845
anon_feat_4 -0.157052 -0.104356 0.092226 0.015121 0.216493 0.377861 0.041805 -0.049984 -0.021786 0.045317 0.211083 0.034927 0.185827 0.142874 0.168091 -0.120749 -0.031990 1.000000 0.814067 -0.069679 0.137030 0.138020 0.250781 0.016870 -0.082272 0.000712 -0.063609 -0.060772 0.389078
anon_feat_5 -0.195246 -0.170992 0.035632 0.009192 0.147987 0.329290 0.043314 -0.059345 0.003749 0.096657 0.238588 0.092052 0.126162 0.086890 0.101210 -0.120068 0.030655 0.814067 1.000000 -0.070303 0.164445 0.126637 0.307804 0.085888 -0.164706 0.000512 -0.164383 -0.176803 0.307157
anon_feat_6 -0.013731 0.172199 -0.056072 0.021225 -0.009634 -0.033714 -0.010851 0.006438 -0.009848 -0.012623 -0.055327 -0.008472 -0.033824 -0.054308 -0.001089 -0.008728 -0.022720 -0.069679 -0.070303 1.000000 -0.033073 -0.083733 -0.073289 -0.077078 0.159428 -0.000483 0.085877 0.058219 -0.027910
anon_feat_7 -0.133586 -0.115189 -0.014540 0.006901 0.019174 0.056907 0.034375 -0.018536 0.042869 0.063982 0.178723 -0.008917 0.018265 -0.015567 -0.024434 -0.038884 0.069486 0.137030 0.164445 -0.033073 1.000000 0.081367 0.221654 0.100472 -0.135518 0.005027 -0.191681 -0.197533 0.052284
anon_feat_8 0.106112 -0.094990 0.108936 0.003584 0.130470 0.080463 0.099443 -0.048199 0.038668 0.054734 0.031228 -0.107560 0.136532 0.075290 0.067968 0.024830 0.009119 0.138020 0.126637 -0.083733 0.081367 1.000000 0.041421 0.010021 -0.077557 0.004910 -0.218565 -0.233985 0.162010
anon_feat_9 -0.818237 -0.078964 -0.034409 0.000472 0.010017 0.046501 0.043540 0.013700 0.005575 0.070476 0.790402 0.355493 0.019577 0.188555 0.233295 -0.010596 0.051930 0.250781 0.307804 -0.073289 0.221654 0.041421 1.000000 0.071100 -0.083305 0.002088 -0.125310 -0.136527 0.039907
anon_feat_10 -0.011636 -0.487098 -0.039040 -0.010269 -0.124635 -0.002811 0.009081 -0.033175 0.078774 -0.005321 0.049566 -0.149590 -0.062123 -0.131908 -0.182914 -0.021945 0.162063 0.016870 0.085888 -0.077078 0.100472 0.010021 0.071100 1.000000 -0.581297 -0.000731 -0.203804 -0.204077 -0.099658
anon_feat_11 0.004953 0.959980 0.050678 0.007953 0.151967 -0.028192 -0.020595 0.076986 -0.096237 0.006169 -0.073886 0.194627 0.042914 0.136377 0.221843 0.012804 -0.171925 -0.082272 -0.164706 0.159428 -0.135518 -0.077557 -0.083305 -0.581297 1.000000 -0.003582 0.308467 0.313919 0.102087
anon_feat_12 -0.001602 -0.005159 -0.002689 -0.001133 0.003792 -0.001040 0.007887 0.007404 0.005830 -0.002519 -0.002113 0.002753 -0.005026 0.001395 0.003176 -0.001637 0.003590 0.000712 0.000512 -0.000483 0.005027 0.004910 0.002088 -0.000731 -0.003582 1.000000 -0.029618 -0.004417 0.003469
anon_feat_13 -0.255275 0.291412 0.014805 0.008743 0.053173 0.030702 -0.030580 0.108098 -0.067665 -0.141534 -0.094053 0.023725 -0.085282 0.014841 0.037926 -0.021949 -0.082977 -0.063609 -0.164383 0.085877 -0.191681 -0.218565 -0.125310 -0.203804 0.308467 -0.029618 1.000000 1.000000 0.055176
cancelation -0.242871 0.294502 0.014949 -0.006737 0.058103 0.005691 -0.032331 0.109633 -0.060068 -0.144559 -0.081911 -0.012482 -0.070436 -0.000038 0.026997 -0.016972 -0.085612 -0.060772 -0.176803 0.058219 -0.197533 -0.233985 -0.136527 -0.204077 0.313919 -0.004417 1.000000 1.000000 0.045016
guest -0.021291 0.072419 0.054598 0.007684 0.815968 0.588676 0.164627 -0.020703 -0.101481 -0.000553 0.006045 0.189886 0.188005 0.104353 0.100461 -0.008321 -0.140845 0.389078 0.307157 -0.027910 0.052284 0.162010 0.039907 -0.099658 0.102087 0.003469 0.055176 0.045016 1.000000

Changing bool into numeric

Changing the type of "Cancelation" column from bool into int by changing "True" into 1 and "False" into 0:

In [37]:
df['cancelation'].replace({True:1,False:0},inplace=True)

Handling null values

In [38]:
df.isnull().sum()   #checking how many null values for every feature
Out[38]:
Unnamed: 0                 0
time_until_order       12681
order_year                 0
order_month             3434
order_week                 0
order_day_of_month         0
adults                     0
children                   4
babies                     0
country                 4341
order_type                 0
acquisition_channel        0
prev_canceled              0
prev_not_canceled          0
changes                 3477
deposit_type            9006
agent                  12196
company                84480
customer_type           9895
adr                     2983
anon_feat_0             3381
anon_feat_1                0
anon_feat_2                0
anon_feat_3                0
anon_feat_4                0
anon_feat_5             4032
anon_feat_6             4233
anon_feat_7             4248
anon_feat_8                0
anon_feat_9             3731
anon_feat_10            2732
anon_feat_11            4957
anon_feat_12               0
anon_feat_13           83766
cancelation                0
guest                      4
dtype: int64

handling null values using interpolation:
Interpolation is handle both object and numeric values easily.
(Interplation is exaplained in the Report)

Two interpolation method will be used

  • Interpolation through padding (copying the value just before a missing entry)
  • Linear Interpolation
In [39]:
## Liner interpolation will be applied to handle data linearly
df = df.interpolate() 
In [40]:
## padding Interpolation applied to handle the values missed by Linear Interpolation,
#padding interpolation, specify a limit that limit is the maximum number of nans the method can fill consecutively.
df = df.interpolate(method='pad', limit=15) #‘pad’: Fill in NaNs using existing values. 

Handling infinte values

In [41]:
df = df.replace([np.inf, -np.inf], np.nan) ## convert inifite values into Nan Values
df = df.dropna(how="any") #drop all the nan values
In [42]:
df.isnull().sum() #Check if there are zero values the data
Out[42]:
Unnamed: 0             0
time_until_order       0
order_year             0
order_month            0
order_week             0
order_day_of_month     0
adults                 0
children               0
babies                 0
country                0
order_type             0
acquisition_channel    0
prev_canceled          0
prev_not_canceled      0
changes                0
deposit_type           0
agent                  0
company                0
customer_type          0
adr                    0
anon_feat_0            0
anon_feat_1            0
anon_feat_2            0
anon_feat_3            0
anon_feat_4            0
anon_feat_5            0
anon_feat_6            0
anon_feat_7            0
anon_feat_8            0
anon_feat_9            0
anon_feat_10           0
anon_feat_11           0
anon_feat_12           0
anon_feat_13           0
cancelation            0
guest                  0
dtype: int64

Convert to Categorial Varaibles

Will use cat labelencoder to convert them easily:
(also exaplained in the Report)
The categorical type is a process of factorization. Meaning that each unique value or category is given a incremented integer value starting from zero.

In [43]:
df['country'] =df['country'].astype('category').cat.codes  
df['order_type'] =df['order_type'].astype('category').cat.codes  
df['acquisition_channel'] =df['acquisition_channel'].astype('category').cat.codes  
df['deposit_type'] =df['deposit_type'].astype('category').cat.codes  
df['customer_type'] =df['customer_type'].astype('category').cat.codes  
df['order_week'] =df['order_week'].astype('category').cat.codes  

Handling Date Column

Converting Data columns into one date column:
Datetime function select's specific columns like year, month and day so predefined columns name will be change:

In [44]:
df = df.rename(columns={'order_year': 'year', 'order_month': 'month', 'order_day_of_month': 'day'})
#We will rename the columns to use "to_datetime" that converts to numbers

Few of the rows has wrong date entry like June month is of 30 days but here there are entries for 31,So we will drop these entries:

In [45]:
result = df.loc[df['month'].isin(['June', 'September','November','April']),['month','day'] ] [df['day'] == 31]
result = result.index
df = df.drop(result)
#months that have 30 days
<ipython-input-45-6264e8b544c4>:1: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

In [46]:
result1 = df.loc[df['month'].isin(['February']),['month','day'] ] [(df["day"] == 31) | (df["day"] == 30) | (df["day"] == 29) ]
result1 = result1.index
#months that have 28 days
df = df.drop(result1)
<ipython-input-46-a0f098d23697>:1: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

now the months are with the correct num of days

Creating Date Object:

In [47]:
df['DATE'] = pd.to_datetime(df.year.astype(str) + '/' + df.month.astype(str) + df.day.astype(str))

Model can't train on date so to handle it each date will be converted into index:

In [48]:
df=df.set_index(df.DATE)
In [49]:
df = df.drop(['Unnamed: 0','year','month','day','DATE'], axis = 1)

Outliers

Using Z-score to find the outliers

In [50]:
df = df.astype(float) #changing values into float
In [51]:
from scipy import stats
outliers = np.abs(stats.zscore(df))
print(outliers)
[[1.00357364 1.50656764 1.51693954 ... 0.91114642 0.76759718 1.36305332]
 [0.29469714 0.23573215 0.25186062 ... 0.91114642 0.76759718 0.04335304]
 [0.98419646 1.28111203 0.25186062 ... 0.91114642 0.76759718 0.04335304]
 ...
 [0.62571871 0.23573215 1.51693954 ... 0.91114642 0.76759718 1.36305332]
 [0.06217103 0.16604016 0.25186062 ... 0.91114642 1.30276665 0.04335304]
 [0.76943795 0.73995573 0.25186062 ... 0.91114642 0.76759718 0.04335304]]
In [52]:
threshold = 3 #we assume that the outlier's Value of Variable is Greater than 3
print(np.where(outliers > 3)) #printing the indices where outliers>3
(array([    0,     0,     6, ..., 89406, 89412, 89412], dtype=int64), array([13, 19, 14, ..., 18,  3, 23], dtype=int64))

Removing Outliers:

In [53]:
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

Checking The New Dataset without Outliers:

In [54]:
df.shape
Out[54]:
(65827, 32)

Correlation between features and Removing Features

Apply Person Correlation to find the highly correlated features

The correlation coefficient has values between -1 to 1 —

  • A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
  • A value closer to 1 implies stronger positive correlation
  • A value closer to -1 implies stronger negative correlation
In [55]:
features = df.columns ## Fetching all Features Column names
## Applying Pearson Correaltion 
mask = np.zeros_like(df[features].corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 
## Creating a Plot Diagram
f, ax = plt.subplots(figsize=(16, 12))
## Title of Plot
plt.title('Pearson Correlation Matrix',fontsize=27)
sns.heatmap(df[features].corr(),linewidths=0.25,vmax=0.7,square=True,cmap="OrRd", 
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});

High Correlated Variables tends to have smiliar inforamtion which tend to bring down the performace of model so highly correlated features will be removed from the model

In [56]:
relevant_features = mask[mask>0.8] ## selecting features with 80% correlation
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

Removing features

Following features were removed from the dataset because of their high corellation:

In [57]:
to_drop = [column for column in upper.columns if any(upper[column] > 0.80)]
df = df.drop(df[to_drop], axis=1)
to_drop
Out[57]:
['anon_feat_11', 'guest']

Separate dataset and labels

In [58]:
#splitting
x=df.drop(['cancelation'], axis = 1)
y=df.cancelation

Dimensonality Reduction and Feature Selection

On machine learning, the performance of a model only benefits from more features up until a certain point. The more features are fed into a model, the more the dimensionality of the data increases. As the dimensionality increases, overfitting becomes more likely.
The Dimensonality Problem usually occur when there is high number of features as they can directly effect the model prediction.
Here the dataset consist of only 35 features and does not have high dimensonality plus by using feature importance only the improtant feautres are being used to train the model hence demionsality reduction is not required for this dataset.

Feature Selection

Since there are number of anontated features and we dont know what they actualy repsent to, we will apply feature imporatance technique to check the importance of each feature and its effect on the model training.

  • Random Forest Feature Importance method is used
In [59]:
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(x, y)
Out[59]:
RandomForestClassifier(random_state=0)
In [60]:
feature_scores = pd.Series(clf.feature_importances_, index=x.columns).sort_values(ascending=False)
feature_scores
Out[60]:
country                0.114353
time_until_order       0.101610
deposit_type           0.093515
anon_feat_10           0.075339
adr                    0.073756
company                0.060873
order_week             0.059893
agent                  0.058540
anon_feat_8            0.054488
order_type             0.050623
anon_feat_13           0.037529
prev_canceled          0.032511
anon_feat_1            0.030943
anon_feat_5            0.023831
anon_feat_0            0.020756
customer_type          0.020187
changes                0.017125
anon_feat_4            0.013101
adults                 0.012354
anon_feat_2            0.012251
anon_feat_12           0.011000
anon_feat_9            0.009463
acquisition_channel    0.009131
children               0.003690
anon_feat_6            0.001666
anon_feat_7            0.000873
prev_not_canceled      0.000598
anon_feat_3            0.000000
babies                 0.000000
dtype: float64

Here we can see that Babies has the least affect on the prediction of model ,but country has the most affect .
also there are many anontated features that have effect on the model prediction so those will be used and left for model training

In [61]:
f, ax = plt.subplots(figsize=(15, 7))
sns.barplot(x=feature_scores, y=feature_scores.index)
ax.set_title("The importance of Features")
ax.set_yticklabels(feature_scores.index)
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()

So for the Model Top 20 Features are selected

In [62]:
## Selecting top 20 features based on the ranking
features = feature_scores.index[0:20]
x = x[features]
x
Out[62]:
country time_until_order deposit_type anon_feat_10 adr company order_week agent anon_feat_8 order_type anon_feat_13 prev_canceled anon_feat_1 anon_feat_5 anon_feat_0 customer_type changes anon_feat_4 adults anon_feat_2
DATE
2015-08-29 124.0 134.0 0.0 0.409388 140.0 506.34375 28.0 240.0 2.0 6.0 0.0 0.0 3.0 4.0 2.0 2.0 0.0 4.0 2.0 0.0
2016-11-27 149.0 2.0 0.0 0.516904 8989.0 500.68750 43.0 9.0 1.0 6.0 0.0 0.0 2.0 0.0 2.0 2.0 0.0 0.0 2.0 0.0
2016-04-13 124.0 19.0 0.0 0.421692 154.0 495.03125 7.0 10.0 0.0 6.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 3.0 3.0
2016-03-05 75.0 145.0 0.0 0.326480 803.0 489.37500 1.0 9.0 0.0 6.0 0.0 0.0 1.0 1.0 2.0 3.0 0.0 0.0 2.0 0.0
2016-09-21 75.0 271.0 0.0 0.357281 10133.0 483.71875 32.0 12.0 1.0 5.0 0.0 0.0 3.0 0.0 0.0 3.0 0.0 0.0 2.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-05-21 27.0 21.0 0.0 0.321296 1575.0 342.00000 13.0 8.0 1.0 6.0 0.0 0.0 2.0 0.0 2.0 3.0 0.0 0.0 2.0 0.0
2016-04-14 40.0 90.0 0.0 0.398224 9095.0 342.00000 7.0 9.0 1.0 6.0 0.0 0.0 3.0 0.0 2.5 2.0 0.0 0.0 2.0 0.0
2015-08-23 124.0 39.0 0.0 0.387626 1296.0 342.00000 28.0 240.0 1.0 6.0 0.0 0.0 5.0 0.0 3.0 2.0 0.0 0.0 1.0 0.0
2015-08-17 124.0 110.0 0.0 0.323147 134.0 342.00000 27.0 240.0 1.0 6.0 0.0 0.0 5.0 0.0 2.0 2.0 0.0 0.0 2.0 0.0
2017-05-28 124.0 183.0 0.0 0.286674 1395.0 342.00000 14.0 9.0 1.0 6.0 0.0 0.0 0.0 3.0 2.0 3.0 0.0 3.0 2.0 0.0

65827 rows × 20 columns

Scale the values (Normalization)

Feature Scaling or Standardization: It is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm. So, Normalization is imporant to scale all data equally for beter results
We can see from looking at the data that its not normalized. For scaling we are using minmax

In [63]:
# Get column names first
names = x.columns
# Create the Scaler object
sc = MinMaxScaler()
# Fit your data on the scaler object
x = sc.fit_transform(x)
x = pd.DataFrame(x, columns=names)

Splitting train data into train and test data

In [64]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.35, random_state = 47)
print ("X_train: ", len(X_train))
print("X_test: ", len(X_test))
print("y_train: ", len(y_train))
print("y_test: ", len(y_test))
X_train:  42787
X_test:  23040
y_train:  42787
y_test:  23040

Preprocessing Test Data

In [65]:
test_df = pd.read_csv ('feature_data_test.csv') #loading the data

Creating function in order to preprocess:

In [66]:
def test_preprocess(df,features):
  df = df.interpolate() 
  df = df.interpolate(method='pad', limit=15) 
  df = df.replace([np.inf, -np.inf], np.nan) ## convert inifite values into Nan Values
  df = df.dropna(how="any")
  df['country'] =df['country'].astype('category').cat.codes  
  df['order_type'] =df['order_type'].astype('category').cat.codes  
  df['acquisition_channel'] =df['acquisition_channel'].astype('category').cat.codes  
  df['deposit_type'] =df['deposit_type'].astype('category').cat.codes  
  df['customer_type'] =df['customer_type'].astype('category').cat.codes  
  df['order_week'] =df['order_week'].astype('category').cat.codes  
  df = df[features]
  return df
In [67]:
test_df = test_preprocess(test_df,features)

Models

Method Follow three Steps:

  • Training of Model on Training Dataset that is Splitted to Train and Test
  • Testing The Real Test and Adding the Results to the Final Prediction File
  • Model Evaluation is done through:
    • Confusion Matrix (numbers change in every RUN but stays in the same domain)
    • K-Fold Cross Validation and ROC/AUC

These are functions that we called in the simple models and advanced models:

In [68]:
def models(alg, X_train, X_test, y_train, y_test,test_df):
    model = alg
    model_alg = model.fit(X_train, y_train)
    global y_probablity, y_pred, test_prob #global variables in order to not be deleted at the end of function
    y_probablity = model_alg.predict_proba(X_test)[:,1] #predicting probability of label
    y_pred = model_alg.predict(X_test) #predicting label
    test_prob = model_alg.predict_proba(test_df)[:,1] #testing the real test
    train_pred = model_alg.predict(X_train)
    name = type(model).__name__
    nn_cm = confusion_matrix(y_test, y_pred) # Creating the confusion matrix
    # Visualization:
    f, ax = plt.subplots(figsize=(5,5))
    sns.heatmap(nn_cm, annot=True, linewidth=0.7, linecolor='olive', fmt='.0f', ax=ax, cmap='YlGnBu')
    plt.title(name)
    plt.xlabel('y_pred')
    plt.ylabel('y_test')
    plt.show()        
    
def Check(model): #making AUC for every model, the final AUC is the mean of k folds
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0,1,100)
    i = 1
    fig1 = plt.figure(figsize=[12,12])
    cv = KFold(n_splits=5, random_state=7, shuffle=True)
    i=1
    for train_index, test_index in cv.split(x): #checking for every k fold 
        #iloc : helps us select a value that belongs to a particular row or column
        X_train = x.iloc[train_index]
        X_test = x.iloc[test_index]
        y_train = y.iloc[train_index]
        y_test = y.iloc[test_index]
        model.fit(X_train, y_train)  # Run Models
        prediction = model.predict_proba(X_test)
        fpr, tpr, t = roc_curve(y_test, prediction[:, 1]) #making ROC 
        tprs.append(np.interp(mean_fpr, fpr, tpr))
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        print("Training Data Accuracy:", model.score(X_train,y_train)*100) 
        print("Test Data Accuracy:", model.score(X_test,y_test)*100) 
        plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
        i= i+1

    plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
    mean_tpr = np.mean(tprs, axis=0)
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, color='blue',
             label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC')
    plt.legend(loc="lower right")
    plt.text(0.32,0.7,'More accurate area',fontsize = 12)
    plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
    plt.show()

KNN

Parameters:

  • algorithm='auto'
  • leaf_size=30
  • metric='minkowski'
  • metric_params=None
  • n_jobs=None
  • n_neighbors=3
  • p=2
  • weights='uniform'
In [69]:
knn = KNeighborsClassifier(n_neighbors=3)
Check(knn)
Training Data Accuracy: 87.61322420766791
Test Data Accuracy: 77.0621297280875
Training Data Accuracy: 87.61322420766791
Test Data Accuracy: 77.0545344068054
Training Data Accuracy: 87.76347271277201
Test Data Accuracy: 76.82491454614508
Training Data Accuracy: 87.74448368842809
Test Data Accuracy: 77.36422331940751
Training Data Accuracy: 87.6875166153963
Test Data Accuracy: 76.78693505507026

In This Model:

  • TP: 6540 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 11118 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 2518 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 2864 Booking cancelation where incorrecltly classified as non cancelation by Model.
In [70]:
models(KNeighborsClassifier(n_neighbors=3), X_train, X_test, y_train, y_test,test_df)
In [71]:
test_pred = pd.DataFrame() #making a DataFrame for the test predictions
In [72]:
test_pred['KNN Prediction'] = test_prob #Adding the predections for this specific alg

Logistic Regression

  • C=1.0,
  • class_weight=None,
  • dual=False,
  • fit_intercept=True,
  • intercept_scaling=1
  • l1_ratio=None,
  • max_iter=100,
  • multi_class='auto'
  • n_jobs=None, penalty='l2'
  • random_state=None
  • solver='lbfgs'
  • tol=0.0001
  • verbose=0
  • warm_start=False
In [73]:
lr = LogisticRegression()
Check(lr)
Training Data Accuracy: 76.5196255293291
Test Data Accuracy: 76.99377183654869
Training Data Accuracy: 76.60697670002469
Test Data Accuracy: 76.80388880449644
Training Data Accuracy: 76.60172420341043
Test Data Accuracy: 76.57424990505127
Training Data Accuracy: 76.68147810565493
Test Data Accuracy: 76.2932016710976
Training Data Accuracy: 76.70426493486765
Test Data Accuracy: 76.2324344853779

In This Model:

  • TP: 5589 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 12084 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 1552 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 3815 Booking cancelation where incorrecltly classified as non cancelation by Model.
In [74]:
models(LogisticRegression(), X_train, X_test, y_train, y_test,test_df)
In [75]:
test_pred['LR Prediction'] = test_prob #Adding the predections for this specific alg

Advance Models

Random Forest

Parameters:

bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False

In [76]:
rf = RandomForestClassifier()
Check(rf)
Training Data Accuracy: 100.0
Test Data Accuracy: 84.39161476530457
Training Data Accuracy: 99.99810106150662
Test Data Accuracy: 84.62706972504937
Training Data Accuracy: 100.0
Test Data Accuracy: 84.73984048613748
Training Data Accuracy: 99.99810109756561
Test Data Accuracy: 84.8841625522218
Training Data Accuracy: 100.0
Test Data Accuracy: 84.08659323965058

In This Model:

  • TP: 7027 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 12427 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 1209 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 2377 Booking cancelation where incorrecltly classified as non cancelation by Model.
In [77]:
models(RandomForestClassifier(), X_train, X_test, y_train, y_test,test_df)
In [78]:
test_pred['RF Prediction'] = test_prob #Adding the predections for this specific alg

MLP Classifier

Parameters:

activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False

In [79]:
mlp = MLPClassifier()
Check(mlp)
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Training Data Accuracy: 82.65889367843376
Test Data Accuracy: 80.93574358195352
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Training Data Accuracy: 81.2992537171721
Test Data Accuracy: 80.35849916451467
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Training Data Accuracy: 82.50541187193802
Test Data Accuracy: 81.12419293581466
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Training Data Accuracy: 82.19399187269758
Test Data Accuracy: 81.46600835548804
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Training Data Accuracy: 82.68390870077096
Test Data Accuracy: 81.32168628940373

**In This Model

  • TP: 7075 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 11662 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 1974 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 2329 Booking cancelation where incorrecltly classified as non cancelation by Model.
In [80]:
models(MLPClassifier(), X_train, X_test, y_train, y_test,test_df)
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

In [81]:
test_pred['MLP Prediction'] = test_prob #Adding the predections for this specific alg

Model Evaluation

The Confustion Metric and The K-Fold Cross Validation is added in the Models/Advance Models Part.
If we look at the accuacry difference of training and test in every K-Fold we can detect that although there was little biase toward the training data but Overall Models are Not Overfitting Like MLP, SVC, Logistic Regression etc

Confusion Metric

Confusion metric is used to evaluate how much the model has predicted correctly, It combine true label and predicted label and gave its evaluation in four ways:

  • True Positive: Tells how many labels model told true are actually true too.
  • True Negative: Tells how many label model told false are actually false too.
  • False Positive: Tells how many label model told false are actually true.
  • False Negative: Tells how many label model told True are actually false.

K-Fold Cross Validation and ROC curve/AUC

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
Cross-validation procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
In addition to this method we use ROC curve/AUC in order to evaluate the model quality.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve,AUC (Area under the ROC Curve).
AUC provides an aggregate measure of performance across all possible classification thresholds. So, we are looking for higher AUC that indicates a better model.

Under-fitting and Over-fitting

Under-fitting and over-fitting are two problems of machine learning. A model usually underperforms due to one of these reasons.
Under fitting happens when the model is too simple i.e. it contains less features to be trained or is regularized too much that the model couldn’t learn anything from the dataset which leads to less variance and too much biasness in predicting wrong outcomes. While on other hand over-fitting in models occur when they are trained so much on the training data that they eventually fail on providing a good prediction on any general unseen dataset (test data).
Both of these issues do not have any fixed solution but can be prevented through number of ways which are implemented in our model i.e:

  • Cross-Validation: Cross-Validation/KFold is one of the most common way to find out of sample prediction error which helps in preventing over-fitting.
  • Under fitting can generally be decrease by creating a complex model and using better features, which is done by creating an ensemble model and using person correlation to find better features to train the model.

Prediction

In [82]:
test_pred
Out[82]:
KNN Prediction LR Prediction RF Prediction MLP Prediction
0 0.333333 1.000000e+00 0.56 1.0
1 0.333333 0.000000e+00 0.40 1.0
2 0.666667 1.000000e+00 0.52 1.0
3 0.333333 1.000000e+00 0.56 1.0
4 0.333333 1.000000e+00 0.43 1.0
... ... ... ... ...
29825 0.333333 0.000000e+00 0.43 1.0
29826 0.333333 1.000000e+00 0.71 1.0
29827 0.666667 5.164296e-224 0.48 1.0
29828 0.666667 0.000000e+00 0.46 1.0
29829 0.666667 1.000000e+00 0.50 1.0

29830 rows × 4 columns

We choose "Random Forest" model, because it gave us the highest AUC(=92) which indicates to better prediction.
Choosing Random Forest predictions to submit on our output file:

In [83]:
test_pred['RF Prediction']
Out[83]:
0        0.56
1        0.40
2        0.52
3        0.56
4        0.43
         ... 
29825    0.43
29826    0.71
29827    0.48
29828    0.46
29829    0.50
Name: RF Prediction, Length: 29830, dtype: float64
In [84]:
test_pred['RF Prediction'].to_csv("submission_group_12.csv")

Other Models

These models we tryed during making the project, they were excluded because of their low AUC.
All codes are written in Markdown and their not a part of the workflow.

Naive Bayers

Parameters:
priors=None ,var_smoothing=1e-09

Code:
nb = GaussianNB()
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])
cv = KFold(n_splits=5, random_state=7, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
nb.fit(X_train, y_train) # Run Models
prediction = nb.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
print("Training Data Accuracy:", nb.score(X_train,y_train)100)
print("Test Data Accuracy:", nb.score(X_test,y_test)
100)
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

naive%20bayers.png

In This Model:

  • TP: 1888 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 7646 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 98 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 3533 Booking cancelation where incorrecltly classified as non cancelation by Model.

Code:
models(GaussianNB(), X_train, X_test, y_train, y_test,test_df)

naive%20bayers2.png

Decision Tree Classifier

Parameters:

ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'

Code:
dt = DecisionTreeClassifier()
scores = []
y_preds =[]
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])

cv = KFold(n_splits=3, random_state=47, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
dt.fit(X_train, y_train) # Run Models
prediction = dt.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
scores.append(dt.score(X_test, y_test))
print("Training Data Accuracy:", dt.score(X_train,y_train)100)
print("Test Data Accuracy:", dt.score(X_test,y_test)
100)
y_preds.append(dt.predict(X_test))
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1

plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

tree.png

In This Model:

  • TP: 6556 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 10581 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 2480 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 2325 Booking cancelation where incorrecltly classified as non cancelation by Model.

Code:
models(DecisionTreeClassifier(), X_train, X_test, y_train, y_test,test_df)

tree2.png

SVC

Parameters:

C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False

Code:
svc = SVC(probability=True)
scores = []
y_preds =[]
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])

cv = KFold(n_splits=3, random_state=47, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
svc.fit(X_train, y_train) # Run Models
prediction = svc.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
scores.append(dt.score(X_test, y_test))
print("Training Data Accuracy:", svc.score(X_train,y_train)100)
print("Test Data Accuracy:", svc.score(X_test,y_test)
100)
y_preds.append(svc.predict(X_test))
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1

plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

svc.png

In This Model:

  • TP: 5969 Booking cancelation where classified correctly as Cancelation by Model.
  • TN: 11707 Booking non cancelation where classified correctly as non cancelation by Model.
  • FP: 1354 Booking non cancelation where classified incorrecltly as cancelation by Model.
  • FN: 2912 Booking cancelation where incorrecltly classified as non cancelation by Model.

Code:
models(SVC(probability=True), X_train, X_test, y_train, y_test,test_df)

svc2.png

Note:

All The visulations and steps are explained also in the report.

i-hope-you-like-it.jpg